Main Processing:

As technology progresses in this age, using data to address every type of issues has started to become important. In this report we attempt to use a data-driven approach to enhance safety in the city of Colchester using two datasets. By analysing the patterns and correlation between various crimes and the whether condition, we may be able to identify patterns to devise measures to increase safety and reduce crimes by analysing the data to help police prioritise the important cases with focused effort, rather than accidentally putting more focus on cases with less significant importance.

#Libraries
library(stringr)
library(knitr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(DT)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(ggcorrplot)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:plotly':
## 
##     select
library(leaflet)
library(leaflet.extras)
#set working directory
getwd()
## [1] "/Users/praj/Documents/Lectures/Data Viz"
setwd("/Users/praj/Documents/Lectures/Data Viz")

#reading the data
rm(list=ls())
initial_crime <- read.csv("crime24.csv")
initial_temp <- read.csv("temp24.csv")
#View the structure and dimension of data:
dim(initial_crime)
## [1] 6304   13
dim(initial_temp)
## [1] 366  18
str(initial_crime)
## 'data.frame':    6304 obs. of  13 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ date            : chr  "2024-01" "2024-01" "2024-01" "2024-01" ...
##  $ lat             : num  51.9 51.9 51.9 51.9 51.9 ...
##  $ long            : num  0.901 0.899 0.902 0.888 0.89 ...
##  $ street_id       : int  2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
##  $ street_name     : chr  "On or near Middle Mill" "On or near Conference/exhibition Centre" "On or near Mason Road" "On or near Kensington Road" ...
##  $ context         : logi  NA NA NA NA NA NA ...
##  $ id              : int  115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ outcome_status  : chr  NA NA NA NA ...
str(initial_temp)
## 'data.frame':    366 obs. of  18 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : chr  "2024-12-31" "2024-12-30" "2024-12-29" "2024-12-28" ...
##  $ TemperatureCAvg: num  6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
##  $ TemperatureCMax: num  7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
##  $ TemperatureCMin: num  5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
##  $ TdAvgC         : num  4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
##  $ HrAvg          : num  86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
##  $ WindkmhDir     : chr  "WSW" "WSW" "W" "SW" ...
##  $ WindkmhInt     : num  22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
##  $ WindkmhGust    : num  42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
##  $ PresslevHp     : num  1025 1028 1028 1032 1035 ...
##  $ Precmm         : num  0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
##  $ TotClOct       : num  4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
##  $ lowClOct       : num  7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
##  $ SunD1h         : num  5.7 0 0 0 0 0 0 1.4 2.8 0 ...
##  $ VisKm          : num  63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
##  $ SnowDepcm      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ PreselevHp     : logi  NA NA NA NA NA NA ...
# Check for missing values
sum(is.na(initial_crime))
## [1] 7014
print(sum(is.na(initial_crime)))
## [1] 7014
sum(is.na(initial_temp))
## [1] 759
print(sum(is.na(initial_temp)))
## [1] 759

Data Preparation and cleaning:

Data preparation is an important step in the analytical process, to ensure integrity and reliability of data sources. The raw/initial crime dataset underwent a cleaning process to address missing values, standardise formats, and remove irrelevant/non-important information. Numeric columns were checked as well, ensuring that missing values were replaced with appropriate values(like mean values). Textual data(like Street Names) is converted to a consistent format for accurate analysis and visualization. We have removed the context and location sub-type columns as they are either empty or have NA values.

Similarly, the temp(we can also consider it the weather data) dataset also underwent cleaning process to ensure its consistency and completeness. Inconsistent variables were spotted and removed/replaced, making the dataset proper for analysis.

#Setting new variable for cleaned dataset
cleaned_crime <- initial_crime

#Listing numeric columns
num_col_crime <- sapply(cleaned_crime, is.numeric)

#Replacing NA values with mean
cleaned_crime[num_col_crime] <- lapply(cleaned_crime[num_col_crime], function(x) {
    ifelse(is.na(x), round(mean(x, na.rm = TRUE), 1), x)
})

#Data cleaning in cleaned set:
#Filling missing values in outcome_status
cleaned_crime$outcome_status[is.na(cleaned_crime$outcome_status)] <- "No Information available"
cleaned_crime$street_name <- str_trim(str_to_lower(cleaned_crime$street_name))

cleaned_crime$date <- ym(cleaned_crime$date)

#Remove irrelevant data columns
cleaned_crime <- subset(cleaned_crime, select = -c(context, location_subtype))

##Repeating process for temp data
cleaned_temp <- initial_temp
num_col_temp <- sapply(cleaned_temp, is.numeric)
cleaned_temp[num_col_temp] <- lapply( cleaned_temp[num_col_temp], 
                                     function(x) {
                                       ifelse(is.na(x), round(mean(x, na.rm = TRUE), 1), x)  
                                      }
                                   )

cleaned_temp$Date <- ymd(cleaned_temp$Date)

cleaned_temp <- cleaned_temp[, !names(cleaned_temp) %in% c("PreselevHp", "SnowDepcm")]

dim(cleaned_crime)
## [1] 6304   11
head(cleaned_crime)
##   X              category persistent_id       date      lat     long street_id
## 1 1 anti-social-behaviour               2024-01-01 51.89301 0.901028   2153130
## 2 2 anti-social-behaviour               2024-01-01 51.88979 0.898830   2153105
## 3 3 anti-social-behaviour               2024-01-01 51.89825 0.902107   2153147
## 4 4 anti-social-behaviour               2024-01-01 51.87837 0.888373   2152856
## 5 5 anti-social-behaviour               2024-01-01 51.87905 0.889521   2152871
## 6 6 anti-social-behaviour               2024-01-01 51.88860 0.899203   2153107
##                               street_name        id location_type
## 1                  on or near middle mill 115967607         Force
## 2 on or near conference/exhibition centre 115967129         Force
## 3                   on or near mason road 115967591         Force
## 4              on or near kensington road 115967062         Force
## 5                 on or near lambeth road 115967058         Force
## 6               on or near trinity street 115967547         Force
##             outcome_status
## 1 No Information available
## 2 No Information available
## 3 No Information available
## 4 No Information available
## 5 No Information available
## 6 No Information available
dim(cleaned_temp)
## [1] 366  16
head(cleaned_temp)
##   station_ID       Date TemperatureCAvg TemperatureCMax TemperatureCMin TdAvgC
## 1       3590 2024-12-31             6.5             7.7             5.0    4.4
## 2       3590 2024-12-30             5.6             6.9             3.4    4.9
## 3       3590 2024-12-29             3.3             4.9             2.2    3.2
## 4       3590 2024-12-28             4.0             5.8             2.3    3.7
## 5       3590 2024-12-27             5.3             6.7             4.3    5.1
## 6       3590 2024-12-26             6.7            10.0             5.6    6.4
##   HrAvg WindkmhDir WindkmhInt WindkmhGust PresslevHp Precmm TotClOct lowClOct
## 1  86.4        WSW       22.7        42.6     1025.3    0.0      4.5      7.2
## 2  94.9        WSW       16.7        40.8     1028.5    0.0      8.0      8.0
## 3  98.6          W       11.4        22.2     1028.5    0.4      8.0      8.0
## 4  98.4         SW        5.5        14.8     1031.8    0.4      8.0      8.0
## 5  98.4          S        6.3        16.7     1034.7    0.4      8.0      8.0
## 6  98.3        WSW        9.3        22.2     1033.6    0.4      8.0      8.0
##   SunD1h VisKm
## 1    5.7  63.4
## 2    0.0  15.3
## 3    0.0   0.5
## 4    0.0   0.1
## 5    0.0   0.5
## 6    0.0   0.2

Validating data after cleaning:

The structure and summary of the data was seen and we confirmed that there are no more NA values in the table since the is.na function returned value 0.

#View the structure of the data
str(cleaned_crime)
## 'data.frame':    6304 obs. of  11 variables:
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ category      : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ persistent_id : chr  "" "" "" "" ...
##  $ date          : Date, format: "2024-01-01" "2024-01-01" ...
##  $ lat           : num  51.9 51.9 51.9 51.9 51.9 ...
##  $ long          : num  0.901 0.899 0.902 0.888 0.89 ...
##  $ street_id     : int  2153130 2153105 2153147 2152856 2152871 2153107 2152963 2152963 2153186 2153163 ...
##  $ street_name   : chr  "on or near middle mill" "on or near conference/exhibition centre" "on or near mason road" "on or near kensington road" ...
##  $ id            : int  115967607 115967129 115967591 115967062 115967058 115967547 115967516 115967638 115967128 115967378 ...
##  $ location_type : chr  "Force" "Force" "Force" "Force" ...
##  $ outcome_status: chr  "No Information available" "No Information available" "No Information available" "No Information available" ...
str(cleaned_temp)
## 'data.frame':    366 obs. of  16 variables:
##  $ station_ID     : int  3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
##  $ Date           : Date, format: "2024-12-31" "2024-12-30" ...
##  $ TemperatureCAvg: num  6.5 5.6 3.3 4 5.3 6.7 9.4 4.3 4.6 7.2 ...
##  $ TemperatureCMax: num  7.7 6.9 4.9 5.8 6.7 10 12.3 6.9 7.9 11 ...
##  $ TemperatureCMin: num  5 3.4 2.2 2.3 4.3 5.6 3.5 2.5 2.5 3.3 ...
##  $ TdAvgC         : num  4.4 4.9 3.2 3.7 5.1 6.4 8.8 1.8 -0.5 4.5 ...
##  $ HrAvg          : num  86.4 94.9 98.6 98.4 98.4 98.3 95.6 84.2 70 83 ...
##  $ WindkmhDir     : chr  "WSW" "WSW" "W" "SW" ...
##  $ WindkmhInt     : num  22.7 16.7 11.4 5.5 6.3 9.3 15.4 16.4 36.8 28 ...
##  $ WindkmhGust    : num  42.6 40.8 22.2 14.8 16.7 22.2 31.5 50 70.4 66.7 ...
##  $ PresslevHp     : num  1025 1028 1028 1032 1035 ...
##  $ Precmm         : num  0 0 0.4 0.4 0.4 0.4 0 0 0.8 0.8 ...
##  $ TotClOct       : num  4.5 8 8 8 8 8 6.8 6.7 4.3 6.6 ...
##  $ lowClOct       : num  7.2 8 8 8 8 8 6.8 7.6 5.2 6.9 ...
##  $ SunD1h         : num  5.7 0 0 0 0 0 0 1.4 2.8 0 ...
##  $ VisKm          : num  63.4 15.3 0.5 0.1 0.5 0.2 13.3 20 38.8 34.9 ...
#View summary of the data
summary(cleaned_crime)
##        X          category         persistent_id           date           
##  Min.   :   1   Length:6304        Length:6304        Min.   :2024-01-01  
##  1st Qu.:1577   Class :character   Class :character   1st Qu.:2024-03-01  
##  Median :3152   Mode  :character   Mode  :character   Median :2024-07-01  
##  Mean   :3152                                         Mean   :2024-06-15  
##  3rd Qu.:4728                                         3rd Qu.:2024-09-01  
##  Max.   :6304                                         Max.   :2024-12-01  
##       lat             long          street_id       street_name       
##  Min.   :51.88   Min.   :0.8788   Min.   :2152686   Length:6304       
##  1st Qu.:51.89   1st Qu.:0.8966   1st Qu.:2153025   Class :character  
##  Median :51.89   Median :0.9013   Median :2153155   Mode  :character  
##  Mean   :51.89   Mean   :0.9029   Mean   :2153873                     
##  3rd Qu.:51.89   3rd Qu.:0.9088   3rd Qu.:2153366                     
##  Max.   :51.90   Max.   :0.9246   Max.   :2343256                     
##        id            location_type      outcome_status    
##  Min.   :115954844   Length:6304        Length:6304       
##  1st Qu.:118009952   Class :character   Class :character  
##  Median :120228058   Mode  :character   Mode  :character  
##  Mean   :120403000                                        
##  3rd Qu.:122339060                                        
##  Max.   :125550731
summary(cleaned_temp)
##    station_ID        Date            TemperatureCAvg TemperatureCMax
##  Min.   :3590   Min.   :2024-01-01   Min.   :-2.60   Min.   : 1.10  
##  1st Qu.:3590   1st Qu.:2024-04-01   1st Qu.: 7.00   1st Qu.:10.72  
##  Median :3590   Median :2024-07-01   Median :10.95   Median :14.75  
##  Mean   :3590   Mean   :2024-07-01   Mean   :10.98   Mean   :15.08  
##  3rd Qu.:3590   3rd Qu.:2024-09-30   3rd Qu.:14.50   3rd Qu.:19.60  
##  Max.   :3590   Max.   :2024-12-31   Max.   :23.10   Max.   :29.80  
##  TemperatureCMin      TdAvgC           HrAvg        WindkmhDir       
##  Min.   :-6.100   Min.   :-6.000   Min.   :59.60   Length:366        
##  1st Qu.: 3.325   1st Qu.: 4.725   1st Qu.:75.90   Class :character  
##  Median : 6.800   Median : 8.200   Median :82.75   Mode  :character  
##  Mean   : 6.486   Mean   : 7.752   Mean   :81.74                     
##  3rd Qu.: 9.500   3rd Qu.:11.000   3rd Qu.:88.80                     
##  Max.   :16.700   Max.   :16.900   Max.   :98.60                     
##    WindkmhInt     WindkmhGust       PresslevHp         Precmm      
##  Min.   : 3.90   Min.   : 11.10   Min.   : 978.9   Min.   : 0.000  
##  1st Qu.:12.22   1st Qu.: 31.50   1st Qu.:1007.5   1st Qu.: 0.000  
##  Median :15.80   Median : 38.90   Median :1013.8   Median : 0.200  
##  Mean   :16.52   Mean   : 40.81   Mean   :1013.7   Mean   : 1.866  
##  3rd Qu.:19.80   3rd Qu.: 48.20   3rd Qu.:1021.0   3rd Qu.: 1.900  
##  Max.   :42.50   Max.   :105.60   Max.   :1037.3   Max.   :38.000  
##     TotClOct        lowClOct         SunD1h           VisKm      
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.10  
##  1st Qu.:3.800   1st Qu.:5.900   1st Qu.: 0.325   1st Qu.:20.73  
##  Median :5.600   Median :6.900   Median : 3.500   Median :30.95  
##  Mean   :5.304   Mean   :6.609   Mean   : 4.203   Mean   :31.42  
##  3rd Qu.:7.200   3rd Qu.:7.600   3rd Qu.: 7.100   3rd Qu.:41.20  
##  Max.   :8.000   Max.   :8.000   Max.   :15.600   Max.   :71.20
#Check for missing values to reconfirm removal
sum(is.na(cleaned_crime))
## [1] 0
sum(is.na(cleaned_temp))
## [1] 0

Analysing the crime dataset:

We checked the category of crime and its frequency. Our dataset contained a total of 14 categories of crimes. The crime having the highest frequency of 2420 was violent crimes and the one with the lowest at 65 was Possession of weapons. We will also be creating a two-way table for analysing different crimes and their outcomes.

#Creating new table for frequency:
crime_freq <- as.data.frame(table(cleaned_crime$category))

#Originally, columns are named as var1 and freq, so we will be renaming them
colnames(crime_freq) <- c("Category", "Crime Frequency")

#Now we calculate the frequency of the crimes to find out which is the highest committed offence
max_freq <- crime_freq[which.max(crime_freq$`Crime Frequency`), ]
max_freq
##         Category Crime Frequency
## 14 violent-crime            2420
#Calculating crime percentage
crime_freq <- crime_freq %>%
  arrange(desc(`Crime Frequency`)) %>%
  mutate(Percentage = round((`Crime Frequency` / sum(`Crime Frequency`)) * 100, 2))

datatable(crime_freq, options = list(pageLength = 10))
#Creating a two-way table for crime category and its outcome
category_outcome <- table(cleaned_crime$category, cleaned_crime$outcome_status)
kable(category_outcome)
Action to be taken by another organisation Awaiting court outcome Court result unavailable Formal action is not in the public interest Further action is not in the public interest Further investigation is not in the public interest Investigation complete; no suspect identified Local resolution No Information available Offender given a caution Status update unavailable Suspect charged as part of another case Unable to prosecute suspect Under investigation
anti-social-behaviour 0 0 0 0 0 0 0 0 710 0 0 0 0 0
bicycle-theft 3 3 2 0 0 0 113 0 0 0 2 0 16 10
burglary 1 10 6 0 0 0 108 0 0 0 7 0 26 13
criminal-damage-arson 6 28 34 0 1 0 246 9 0 11 10 0 110 24
drugs 2 13 19 7 4 0 18 118 0 16 9 0 16 43
other-crime 4 7 12 2 1 1 14 1 0 1 5 0 43 9
other-theft 1 6 5 2 0 0 283 1 0 0 12 1 74 27
possession-of-weapons 2 8 9 0 1 0 6 4 0 4 8 0 16 7
public-order 8 20 20 11 0 0 165 12 0 3 27 0 160 32
robbery 0 7 4 1 0 0 41 0 0 0 4 0 25 3
shoplifting 5 82 65 2 0 0 313 25 0 1 7 0 92 37
theft-from-the-person 2 2 0 0 0 0 66 0 0 0 2 0 14 5
vehicle-crime 1 7 6 0 0 0 208 0 0 0 3 0 33 12
violent-crime 84 96 107 12 7 1 446 31 0 20 141 0 1195 280

Creating plots for various analysis:

Plotting of various datasets is important to figure out the correlation of data or to find patterns. In our case, plotting has shown that violent crimes and anti-social behavior are the most common crimes in Colchester in 2024. The pie chart shows that 38.4% of total crimes is just violent crimes, which I believe does warrant more study as to why it is so. More study of external factors, not just the weather, such as Income, employability, access to mental health support can be studied to find out the root cause.

#pie chart for various crimes category and their percentage
plot_ly(crime_freq, labels = ~Category, values = ~Percentage, type = 'pie',
        textinfo = 'label+percent',
        insidetextorientation = 'radial',
        marker = list(line = list(color = '#FFFFFF', width = 0.5))) %>%
  layout(title = "Crime Distribution by Category - Percentages")

By hovering over the bar plot, we can see the exact number of crimes of that Category that have been committed. This is for easier understanding of the numbers as, at first glance, we can see the general number of crimes to gauge the severity and then we can hover over the bars for more details.

The second bar plot, we can see that the most common place for crimes to happen are in or near supermarkets, which we can attribute to being crowded places, followed by shopping areas.

#Interactive bar plot for category to frequency analysis
plot_ly(crime_freq, x = ~reorder(Category, -`Crime Frequency`), y = ~`Crime Frequency`, type = 'bar',
        marker = list(color = 'lightblue')) %>% 
        layout(title = "Interactive Crime Frequency by Category", xaxis = list(title = ""), 
        yaxis = list(title = "Frequency"))
#Bar plot to check the crime distribution in Colchester
top_5_crime_places <- cleaned_crime %>%
  count(street_name, sort = TRUE) %>%
  slice_max(n, n = 5)

ggplot(top_5_crime_places, aes(x = reorder(street_name, n), y = n)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(title = "Top 5 places where crimes happen in Colchester", x = "Place", y = "Number of Crimes")

For the 2D density plot, the darker colors indicate the higher crime rates in that area.We can observe quite the contrast at the centre of the plot.

For the most common crimes, which we will be considering as the top 4 crimes, most of their outcomes were as follows : For anti-social behavior - No information available for all the cases. For criminal damage arson and shoplifting, we have the most common outcome as Investigation complete, no suspect identified, while for violent crimes, it is Unable to prosecute. From all these outcomes we can see that for the most commonly committed crimes, there are no severe repercutions/outcomes which would work as a deterrant for offenders commiting similar crimes. Bar plots are easy to read and with the shortened forms for the various outcomes, they are easy to understand as well. Setting the colors on our makes makes the readability of the plots better.

#2D Density plot for crimes
ggplot(cleaned_crime, aes(x = long, y = lat)) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", contour = TRUE) +
  scale_fill_viridis_c() +
  labs(title = "2D Density Plot: Crimes", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  coord_fixed()
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Bar plot to check top crime category and outcomes
crime_outcome <- cleaned_crime %>%
  group_by(category, outcome_status) %>%
  summarise(count = n(), .groups = 'drop')

top_categories <- crime_outcome %>%
  group_by(category) %>%
  summarise(total_count = sum(count)) %>%
  arrange(desc(total_count)) %>%
  head(4) %>%
  pull(category)

crime_outcome_top <- crime_outcome %>%
  filter(category %in% top_categories)

#Shortening the labels because they are too big for x-axis
crime_outcome_top <- crime_outcome_top %>%
  mutate(outcome_shortened = case_when(
    outcome_status == "No Information available" ~ "NI",
    outcome_status == "Investigation complete; no suspect identified" ~ "IC",
    outcome_status == "Court result unavailable" ~ "CRU",
    outcome_status == "Status update unavailable" ~ "SUU",
    outcome_status == "Unable to prosecute suspect" ~ "UP",
    outcome_status == "Local resolution" ~ "LR",
    outcome_status == "Further action is not in the public interest" ~ "NAct",
    outcome_status == "Action to be taken by another organisation" ~ "AO",
    outcome_status == "Offender given a caution" ~ "OC",
    outcome_status == "Awaiting court outcome" ~ "ACO",
    outcome_status == "Formal action is not in the public interest" ~ "NoA",
    outcome_status == "Suspect charged as part of another case" ~ "CAC",
    outcome_status == "Further investigation is not in the public interest" ~ "NFI",
    outcome_status == "Under investigation" ~ "UI",
    TRUE ~ outcome_status
  ))

ggplot(crime_outcome_top, aes(x = reorder(outcome_shortened, -count), y = count, fill = outcome_shortened)) +
  geom_bar(stat = "identity") +
  facet_wrap(~category, scales = "free_x") +
  theme(axis.text.x = element_blank()) +
  labs(title = "Crime Outcomes according to Top 4 Categories", x = "Outcome", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c(
    "NI" = "grey",
    "IC" = "lightblue",
    "CRU" = "lightgreen",
    "SUU" = "lightpink",
    "UP" = "lightyellow",
    "LR" = "lightcoral",
    "NAct" = "lightgray",
    "AO" = "lightsalmon",
    "OC" = "lightseagreen",
    "ACO" = "lightsteelblue",
    "NoA" = "lightgoldenrodyellow",
    "CAC" = "violet",
    "NFI" = "darkorange",
    "UI" = "salmon"
  ))

The interactive box plot helps to check what the outcome was for a specific crime. Hovering over the plot will show the exact outcomes of the crime after investigation.

#Box plot to display outcomes for crimes:
category_colors <- c(
  "anti-social-behaviour" = "red", 
  "bicycle-theft" = "skyblue", 
  "burglary" = "lightgreen", 
  "criminal-damage-arson" = "purple", 
  "drugs" = "orange", 
  "other-theft" = "yellow", 
  "possession-of-weapons" = "pink", 
  "public-order" = "cyan", 
  "robbery" = "brown", 
  "shoplifting" = "magenta", 
  "theft-from-the-person" = "lightblue", 
  "vehicle-crime" = "darkgreen", 
  "violent-crime" = "darkred", 
  "other-crime" = "blue"
)
plot_ly(cleaned_crime, y = ~outcome_status, type = "box", color = ~category, colors = category_colors) %>%
  layout(title = "Box Plot: Outcomes by Crime Category")

Checking crimes with respect to the weather(temp) information provided:

According to the time-series plot, we can see that as the the temperature spiked then dropped, there was an increase in the crimes committed. Whereas, during the time periods where the the temperature is relatively stable or without high fluctuation, we can observe that there are less crimes. From this we can conclude that the drastic change in weather is a factor for increased crimes. All of this can be interpreted from our time-series plot, which is a type of plot that shows how something changes over time.

#Doing a time-series plot for crime count with respect to temperature
temp_for_crime <- cleaned_crime %>%
  mutate(date = as.Date(date)) %>%
  group_by(date) %>%
  summarise(crime_count = n()) %>%
  left_join(cleaned_temp, by = c("date" = "Date"))

scaling_factor = 10
ggplot(temp_for_crime, aes(x = date)) +
  geom_line(aes(y = crime_count), color = "blue", linewidth = 1) +
  geom_line(aes(y = TemperatureCAvg * scaling_factor), color = "red", linetype = "dashed", linewidth = 1) +
  scale_y_continuous(
    name = "Crime Count",
    sec.axis = sec_axis(~ . / scaling_factor, name = "Average Temperature (°C)")
  ) +
  labs(title = "Time-Series of Crime Count and Temperature", x = "Date") +
  theme_minimal()

Correlation analysis:

For our correlation plot, the values will be ranging from +1 to -1 where +1 indicates perfect positive correlation which means that as one value increases(temperature) the other(crimes) also increase and -1 means negative correlation, that is, temperature and crimes are inversely proportional. 0 means there is no correlation. From our bar plot, we can see that the number of crimes are higher in Summer, but with violent crimes being higher in count in Winter

#Checking if there is any correlatin between various crimes and the weather
crime_tempData_correlation <- temp_for_crime %>%
  dplyr::select(
    Crimes = crime_count,
    Avg_temp = TemperatureCAvg,
    Max_temp = TemperatureCMax,
    Min_temp = TemperatureCMin,
    Hour_avg = HrAvg
    ) %>%
  cor(use = "complete.obs")

ggcorrplot(crime_tempData_correlation, lab = TRUE)

#Analysing crimes by seasons:
cleaned_crime$season <- case_when(
  month(cleaned_crime$date) %in% 3:5 ~ "Spring",
  month(cleaned_crime$date) %in% 6:8 ~ "Summer",
  month(cleaned_crime$date) %in% 9:11 ~ "Autumn",
  TRUE ~ "Winter"
)


#Summarising crime counts by season and category
crime_by_season <- cleaned_crime %>%
  group_by(season, category) %>%
  summarise(crime_count = n(), .groups = "drop")

#Define custom colors for seasons (optional)
season_colors <- c("Spring" = "Green",
                   "Summer" = "#Orange",
                   "Autumn" = "#DarkOrange",
                   "Winter" = "#Blue")

#plotting stacked bar chart
ggplot(crime_by_season, aes(x = season, y = crime_count, fill = category)) +
  geom_col(position = "stack") +
  scale_fill_viridis_d(option = "C", name = "Crime Category") +  # Distinct colors for categories
  scale_x_discrete(drop = FALSE) +  # Ensure all seasons are shown
  labs(title = "Crime Distribution by Season & Category", x = "Season", y = "Number of Crimes") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.position = "right"
  )

Heatmap to see crime hotspots:

A heatmap on a map shows where, in our instance, a crime is happening in more intensity(more counts of crime) using the color gradient. The places with reddish-orange hue indicate that there are more crimes in that area as compared to the lower number of crimes indicated by the pale blue-green colors. By using the leaflet package, out heatmap is interactive, which means that we are able to zoom, click, explore into a specific region to explore the data in depth upto a certain level.

crime_heatmap <- cleaned_crime %>% filter(!is.na(lat) & !is.na(long))

leaflet(data = crime_heatmap) %>% addTiles() %>% addHeatmap(lng = ~long, lat = ~lat, radius = 10)

Leaflet for violent crimes:

Similar to the heatmap using leaflet, we care checking the highest value’s category over a map. Observing this interactive map, we can conclude that most of the violent crimes are committed more at the city centre than its surrounding areas.

violent_crimes <- cleaned_crime %>% filter(category == "violent-crime")
violent_map <- violent_crimes[!is.na(violent_crimes$lat) & !is.na(violent_crimes$long), ]

leaflet(data = violent_map) %>%
  addTiles() %>%
  addCircleMarkers(lng = ~long, lat = ~lat,
                   radius = 3, color = "blue", stroke = FALSE, fillOpacity = 0.4,
                   popup = ~paste(category, outcome_status))

Conclusion:

From our analysis we can conclude that the weather did play an important role in influencing the crime trends in Colchester in 2024. We also saw that most of the overall crimes were committed in Summer while most violent crimes are committed in Winter. There are significantly more crimes committed around the city centre than its surrounding areas. Also, there were no major repercutions for the top 4 committed crimes in Colchester in 2024. If we had a bigger dataset, specifically a dataset spanning over 2 to 4 years we may have been able to understand the trends better and then, on analysis of those trends we/the police may have been able to devise a strategy to focus their efforts on curbing those crimes in a better and efficient manner.

References:

  1. R for Data Science. O’Reilly Media.
  2. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
  3. Grolemund, G., & Wickham, H. (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1–25.
  4. Wickham, H. (2019). Advanced R (2nd ed.). Chapman and Hall/CRC.
  5. Sites : DataCamp tutorials, Hackr.io for visualisation practice/learning.
  6. Sancho, J. L. V., Domínguez, J. C., & et al. (2014). An approach to the taxonomy of data visualization.